-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: reset_index on a MultiIndex with duplicate levels raises a ValueError #44755
Conversation
johnzangwill
commented
Dec 4, 2021
- [x ] closes BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410
- [x ] tests added / passed
- [x ] Ensure all linting tests pass, see here for how to run them
- [x ] whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note as well
We should discuss this first, @mroeschke and I agreed that we should only add a note to the docs |
By all means! This arose due to @rhshadrach's efforts to break my #44267. I found that a trivial and sensible change to frame.py fixed the problem, and also fixed #44410. I know that you decided to document rather than fix. But since the bug was a regression, and the fix seems simple, I suggest to accept my fix. It will also be a prerequisite for my #44267. |
Imo this change is neither trivial nor simple. Accidentially ending up with duplictated column names is not desireable. This is a real PITA in indexing among others. For example:
|
Agreed on both counts. For #44267, it is the case that the user called the function with duplicate columns, the internals of groupby moved them into the index, and we want to restore them as columns. For this, it would suffice to have an internal version of reset_index where we can override allow_duplicates, i.e. this pattern:
where #44267 could then call
I'm not sure what this means (in other words, what should be changed inside of insert?). In any case, for #44267, I don't think we should be reimplementing reset_index via calling insert directly (to be clear, I'm not suggesting you're saying that). |
What I meant with respect to insert: If we want to respect the flag, we should change the keyword But I would prefer not changing the public behavior, so I like your idea with a private implementation better. |
There is a whole chapter on this: https://pandas.pydata.org/pandas-docs/stable/user_guide/duplicates.html.
It seems to me that if you have a legal Series (with duplicates), then it is OK for |
I personally am not a fan of this resolution. My understanding is that In any case, this an API change that will need to have a FutureWarning before it can be implemented in 2.0 if we go this route. |
Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-01-28 22:44:47 UTC |
I have added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine one comment. ping on greenish
@@ -5792,6 +5805,8 @@ class max type | |||
new_obj = self | |||
else: | |||
new_obj = self.copy() | |||
if allow_duplicates is not lib.no_default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt we do the same as you are doing in insert and just set this here e.g.
if allow_duplicates is not lib.no_default:
allow_duplicates = False
.....
simpler & more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Your suggestion has got to be a typo. That change would render the argument always False
and fail my tests.
If I take out the next line, which is type-checking, then I will have to take out my test https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/tests/frame/methods/test_reset_index.py#L474
If I copy in the code from insert
, then the lib.no_default
no longer has any function.
I only used lib.no_default
to avoid your original objections to this argument. I don't like it, because it makes no sense in the documentation and disguises the real default, which is False
.
I would be more than happy to take it out and go back to the original much simpler allow_duplicates: bool = False
, which is the case for insert
in main and for Series
in this PR: https://github.com/johnzangwill/pandas/blob/dddae0734e37ab9fa4ae4a89e68043c4631fd68c/pandas/core/series.py#L1369
@jreback Let me know what you want:
- Leave as is.
- Revert
lib.no_default
and have all the argumentsallow_duplicates: bool = False
cc @rhshadrach, @phofl, @mroeschke. Comments on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @johnzangwill